Character Data in Programming Languages
The Cchar
type is supposed to be large enough to store any member of the execution character set. If a genuine character from that set is stored in achar
object, its value is equivalent to the integer code for the character and is non-negative. Thechar
type is also equivalent to a single byte and may be signed or unsigned (implementation dependent).C does not actually define the size of a byte, so in principle a byte could be made large enough so a
char
would accommodate multi-octet characters and Unicode characters. However, in most implementations, bytes andchar
objects are 8 bits, and multi-octet characters require a sequence ofchar
objects.Instead, C provides the wide character or
wchar_t
type. This is really supposed to be large enough to hold the largest character in any extended execution set supported by the implementation ( including MBCS encodings). It permits internal processing using fixed-size characters; C library functions such asmbstowcs( )
andwcstombs()
convert between SBCS/MBCS strings and wide character strings. However, the size ofwchar_t
is implementation specific; although it is usually 16 or 32 bits, on some implementations it is equivalent tochar
.Java takes a different approach: Bytes remain 8 bits, but a Java
char
is a 16-bit unit intended to contain a Unicode character.Finally, programming languages generally provide some abstraction away from encoding details. For example, the C character constant 'A' may have the value 0x41 for an ASCII-based implementation, but 0xC1 for an EBCDIC-based implementation. Nevertheless, programs may make more subtle assumptions about character encodings, such as assuming that A-Z have sequential contiguous code points (not true in EBCDIC).